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Rabin Karp algorithm is frequently used to determine the similarity between 
texts, using the hash function to compare the string identified and the 
substring in the text. The choice of the k value in the K-gram is often 
unrestricted. The number of k values used when cutting some terms will take 
longer if tried one by one. This research will perform a word cutting test on a 


script using K-gram 0 to 8. The results will cover the effect of the value of 
each K used on the similarity percentage produced. This research aims to 
Keywords: determine the effect of the number of K-grams on the performance of Rabin 
Karp in text matching. The test underwent 20 sentences and 10 times using 


K-gram the dice coefficient for text similarity testing. The conclusion of this research 
Perfor mance should not use the K-gram 0 to 2 due to the K-gram basic principle: 
Rabin Karp character deduction. Subsequently, if the character is 0,1,2, it does not have 


Similarity a meaning yet; thus, it gets a high similarity percentage. Based on trials by 
Text adjustment taking samples of K-gram 0 to 8 from 10 test data sets; the K-gram 3 is the 
best among K-grams 0 to 8. 
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1. INTRODUCTION 

One of the issues from the encouragement on information and communication technology is 
plagiarism. The internet and the accessibility of information in one click are often associated with the 
development of plagiarism [1]. According to Fan in Talib et al. [2] the process to extract pattern from a 
textual data source is a text mining. Retrieving information from the text is the main focus of text mining [3]. 
To determine the level of similarity between texts and can also be used to compare documents, it needs to be 
tested with an appropriate algorithm. 

Algorithms for text adjustment are very diverse, one of them is Rabin Karp's algorithm which is one 
of the algorithms used in text mining to match text or strings [4], [5]. This text matching uses the hash 
function as a comparison between the search string (m) and the substring in the text (n) [6]. K-gram is a 
method to extract letters from a number of characters from a word and a series of terms with length K where 
text is continuously read from the beginning until the end of the document [7]. N-Gram, Base and modulo 
affect the degree of similarity [8]. The K-gram length is a determinant of plagiarism level. Determining the 
exact K-gram length produces accurate results. Hashing is a means to convert strings to integers [9]. In 
addition to K-gram, the process of document adjustment can be done using the N-Gram technique and Rabin 
Karp methods. N-gram is a method to get N piece character of a sentence based on the number of N specified 
[9], [10]. 
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Many studies utilizing the Rabin Karp algorithm for various cases such as to detect similarities of 
the participants’ answers for essay writing test [11] in addition to taking from the website, the Rabin Karp 
algorithm can also be used to search for studio locations by generating the category and list of rehearsal 
studios [12]. The larger the file size is, the longer the time to looking out for similarity. If the file does not 
undergo an indexing process, the time required is shorter but the similarity of the value decreases, the modulo 
value affects the processing time, but not the similarity value and smaller K-gram results is better in accuracy 
of similarity values compared to larger K-gram [13]. The Winnowing algorithm can also be used to detect 
some sort of plagiarism by searching for fingerprinting documents through converting N-gram sequences 
from text into a set of hash [14]. Substantial amount of applications apply sequencesof N-Gram which 
weaken its performance [15]. That previously mentioned research has determined which K-gram will be used 
but the explanation of the reasons for the selection of the k values has not been widely explained. Rabin 
Karp's algorithm can be used for image or pattern matching, such as fingerprint matching [16]. Rabin Karp 
also has better performance than other algorithms in the case of semantic-based documents [17] and requires 
shorter time [18]. In addition to text adjustment, rabin karp images and patterns can also be used to optimize 
performance on parallel programming algorithms [19], [20]. 

The value produced by the K-gram is not always an accurate representation of the document [21]. 
The selection of the k value on K-gram in word cutting is often done freely. The number of k values that can 
be used when cutting words will take longer time if tried one by one. Studies that discuss the selection of the 
k value on K-gram are still limited in number, consequently this one will observe the testing of trials of 
words in the text using K-gram 0 to 8 with the reason for cutting the smallest word from 0 and the longest 8 
word cuts, more than 8 fixed deductions can be done but not all text can be done depending on the number of 
characters of the text. The effect of each k value used on the percentage of similarity generated will be seen 
as the result of the trial. The results is shown in the form of an evaluation of the performance of the K-gram 
on the Rabin Karp algorithm. The contribution of this study is to determine the effect of the amount of 
K-gram on the performance of the Rabin Karp algorithm in text matching. 


2. METHOD 
This research consists of several steps. The steps of the research are shown in Figure 1. These 
following steps are the explanation for each process: 

a) Identifying the existing problems from the background, formulation, problem limitation, objectives, 
benefits, to the methodology used. 

b) Both literature study and literature review are conducted on several references that are relevant to the 
research topic. The reference referred to in this study is the K-gram on the Rabin Karp algorithm. 

c) Carrying out sentences adjustment by taking two sampling sentences. 

d) Preprocessing is executed in several stages: 

-  Tokenization is the process of removing punctuations and changing it to the source text and words 
that are wanted to be found into words without capital letters. 

- Filtering, that is the deletion of words which often appear such as prepositions, conjunctions, 
pronouns, as well as affixes. 

- Stemming, the process of converting words into their basic form. 

e) After going through the results of preprocessing, the use of the Rabin Karp algorithm with the initial 
process of parsing is done subsequently. It is the process of cutting the character letters using the K- 
gram method. 

f) Hashing, i.e. converts string characters to integers [22]. This process converts text into hash values 
using ASCII code. As exemplifications: The use of the Rabin-Karp formula with the use of K-gram of 5 
with the word clipping is "GUDEG". To get the hash value of the word can be seen from the following 
calculation: 

- N-gram: 5 
- Basis: 10 
-  Modulo: 101 
Then determining the ASCII code or ASCII character values from the word GUDEG: G: 103, U: 
117, D: 100, E: 101, G: 103. Afterwards, calculating the hash with the (1) [8]. 


H = ©% CW * b”) mod q (1) 
where: 


H : Hash value 
C  : ASCII character values 
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n ‘n-gram 
b : constant prime number (base) 
q  : modulo 


H = (103 « 106- + 117 « 106-2 + 100 * 106-9 + 101 » 106- + 103 * 106-5) mod 101 H=47 

g) Proceeding to the calculation of similarity using the Dice Coefficient by calculating the value of n-gram 
[23], [24], with the Calculation as in (2) is done after the adjusting process [25]. Fingerprint hash is 
unique and non-duplicated hashes. 


2*C 


s= =< x 100% (2) 


where: 

c: sum between hash a and b fingerprints 

adan b : number of words parsed or fingerprint hashes in text 1 and text 2 
s: similarity value 


Study literature and review 


Identifying the existing problems 


Input data with sentences by taking 
two sampling sentences 


Data Preprocessing : 
Tokenizing 

Filtering 

Stemming 


Rabin — Karp Algorithm 
using K-Gram 


Similarity Score using the Dice 
Coefficient 


Figure 1. Research flowchart 


3. RESULTS AND DISCUSSION 

Detecting plagiarism starts from producing a percentage of similarity in the text. Sample text using 
Indonesian. Table 1 contains an example of the text to be tested. After inputting the two texts, the next step is 
to pre-process the provided text data. Table 2 is the result of the text preprocessing process which consists of 
tokenizing, filtering and stemming. 

Table 2 describes the results of the pre-processing stage carried out by the system. At the tokenizing 
stage words are separated based on their order, so that they become tokens. At the filtering stage words that 
often appear in form of prepositions, conjunctions, pronouns and affixes are removed. Finally stemming is 
the process of converting the terms into their base form. After doing the preprocessing process, the results 
will be obtained from the text preprocessing as in Table 3. 


Table 1. Text similarity test data 1 


Text 1 Text 2 
Gudeg adalah makanan khas Yogyakarta Gudeg merupakan ciri khas makanan dari Yogyakarta 
Gudeg is Yogyakarta's signature dish Gudeg is a signature cuisine from Yogyakarta 
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Table 2. Text preprocessing 
Process Result 
Tokenizing Text 1 gudeg adalah makanan khas Yogyakarta 
gudeg is yogyakarta signature dish 
Filtering Text 1 gudeg makanan khas Yogyakarta 
gudeg yogyakarta signature dish 
Steaming Text 1 gudegmakankhas yogyakarta 
gudeg yogyakartasignaturedish 
Tokenizing Text 2 gudeg merupakan ciri khas makanan dari Yogyakarta 
gudeg is a signature cuisine from yogyakarta 
Filtering Text 2 gudeg ciri khas makanan Yogyakarta 
gudeg signature cuisine yogyakarta 
Steaming Text 2 gudegcirikhasmakanyogyakarta 
gudegsignaturecuisineyogyakarta 


Table 3. Text preprocessing results 

Text Clause 

Text 1 gudegmakankhasyogyakarta 
gudegyogyakartasignaturedish 

Text2 gudegcirikhasmakanyogyakarta 
gudegsignaturecuisineyogyakarta 


Table 3 presents the results of pre-processing in the form of basic text. The results of text 
preprocessing will be used for applying the Rabin-Karp algorithm. This algorithm has stages of K-gram and 
Hashing, to compare between matching strings. From the sampling data the matching is done using K-gram 4 
which is described in Tables 4, 5 and 6. In Table 4, text 1 and text 2 from the pre-processing results are cut 
into 4 characters using K-gram 4, this process is used to get the character chunks. The result of this truncation 
is then hashed, which is to convert the string character into an integer. This process converts text into hash 
values using ASCII code. The results can be seen in Table 5. After the adjusting process then proceed to the 
calculation of similarity using the dice coefficient. Table 6 is the fingerprint results of the hash text 1 and text 
2 and the resulting similarity. 


Table 4. Results with K-gram 


Text K-gram Partition 


Text 1 {gude} {udeg} {degm } {egma} {gmak}{maka} {akan} {kank} {ankh} {nkha} {khas} {hasy} {asyo }{syog}{yogy}{ogya}{gyak 
}{yaka} {akar} {kart} {arta} 

Text 2 {gude} {udeg} {degc} {egci} { gcir} { ciri} {irik} {rikh} {ikha}{khas }{hasm}{asma}{smak} {maka} {akan} {kany}{anyo} {nyo 
ght{yogy} {ogya}{gyak} {yaka} {akar} {kart} {arta} 


Table 5. K-gram hashing result 


Text Hashing 

Text 1 152451 169041 146563 148190 151456 158090 143231 155471 143698 160598 156183 151547 144464 169030 
175736 161632 152908 174062 143235 155524 144274 

Text 2 152451 169041 146553 148088 150341 145833 154811 165720 153943 156183 151535 144318 167428 158090 


143231 155485 143859 162375 175736 161632 152908 174062 143235 155524 144274 


Table 6. Fingerprint results and similarity level 


Process Result 
Fingerprint 152451 169041 158090 143231 156183 175736 161632 152908 174062 143235 155524 144274 
Similarity 52.17% 


Researchers augmented some additional testing data using several texts with K-gram 0 to 8, the data 
is in Table 7 (in Appendix). In this paper, the research use text in Indonesian. The data sample uses 10 sets of 
sentence testing data with different sentence lengths and adds the number of K-grams, from 0-8, described in 
Table 7 (in Appendix). From the sample, text adjustment applies K-gram 0 to 8. This test uses 20 sentences 
with 10 iterations. The sentences used have different lengths. Table 7 (in Appendix) presents the similarity 
values of each K-gram. The result of the similarity percentage shows that there is similarity where the 
percentage value is getting lower. In testing 2, it stops at K-gram 5 while testing 4 stops at K-gram 7 since 
the same fingerprint value has not been found anymore, so the similarity value does not exist. 100% 
similarity is achieved as the results of testing for K-gram 0, as there is no character clipping in words, while 
in K-gram 1 and 2, the similarity percentage is close to 100% because the character clipping | and 2 do not 
have meaning yet, so that the value of similarity is high. From the 10 tests performed, it can be seen that there 
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was a very significant distinction in terms of the values of K-gram 2 and 3. K-gram 3 also got a consistent 
percentage value on each test. From testing with K-grams 4 to 8, it shows that it lowers the value of similarity 
percentage. Lower percentages will produce lower similarities. The smaller the percentage of similarity, the 
lower the ability to detect similarities between texts vice versa. 


4. CONCLUSION 

The similarity of the text can be seen from the results its adjustment. The K-gram 0 to 2 due to the 
K-gram basic principle: character deduction. Subsequently, if the character is 0,1,2, it does not have a 
meaning yet; thus, it gets a high similarity percentage. This research resulted in the recommendation of the 
best K-gram 3 values among K-gram 0 to 8 based upon the trials that have been done. Researchers only took 
K-gram 0 to 8 test samples with 10 test data sets since the clipping of K-gram 3 already has meaning. 
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APPENDIX 
Table 7. Test results for several texts and their similarity 
Testing Text 1 Text 2 K-gram Similarity Results 
Testing 1 In Indonesia: Mikroorganisme In Indonesia: Virus adalah 0 100.00% 
patogen meginfeksi sel makhluk Mikroorganisme patogen meginfeksi sel 1 97.30% 
hidup dapat disebut dengan virus. makhluk hidup yang memiliki 2 94.37% 
Virus memiliki kemampuan untuk kemampuan untuk mereplikasi diri ke 3 90.61% 
mereplikasi diri ke dalam sel dalam sel makhluk hidup karena tidak 4 87.83% 
makhluk hidup. Virus tidak memiliki memiliki perlengkapan seluler sehingga 5 84.82% 
perlengkapan seluler sehingga tidak tidak dapat bereproduksi sendiri. 6 81.87% 
dapat bereproduksi sendiri. ij, 79.38% 
8 76.92% 
Pathogenic microorganisms that Viruses are pathogenic microorganisms 
infect living cells are called viruses. that infect living cells that can replicate 
Viruses can replicate themselves into themselves into living cells because they 
living cells. Viruses do not have do not have cellular structure, therefore, 
cellular structure, therefore, they cannot reproduce on their own. 
they cannot reproduce on their own. 
Testing2 In Indonesia: Keahlian untuk In Indonesia: Sebuah karya 0 100.00% 
membuat karya yang bermutu disebut pengungkapan rasa dan keindahan yang 1 75.86% 
dengan seni menyajikan kreatifitas disebut dengan 2 37.50% 
seni 3 20.83% 
4 13.04% 
5 4.55% 
The skill to create quality works is A work of expressing taste and beauty 6 - 
called art. that shows creativity is called art 7 - 
8 z 
Testing3 In Indonesia: Data mining yaitu In Indonesia: Data mining yaitu 0 100.00% 
sekumpulan data yang diproses sekumpulan data dalam jumlah besar 1 90.91% 
sedemikian rupa untuk mendapatkan atau kompleks yang dianalisis secara 2 45.16% 
nilai tambah berupa pengetahuan otomatis utnuk menemukan pola atau 3 30.63% 
kecenderungan yang penting dan 4 28.07% 
terkadang tidak disadari keberadaannya 5 28.07% 
6 26.79% 
Data mining is a collection of data Data mining is a large or complex 7 25.45% 
that is processed in such a way to collection of data that is automatically 8 24.07% 
obtain added value in the form of analyzed to find important and unknown 
knowledge. patterns or trends 
Testing5 In Indonesia: Di dalam tata surya In Indonesia: Di dalam tata surya 0 100.00% 
terdapat kumpulan benda langit yaitu terdapat kumpulan benda langit yaitu 1 94.74% 
sebuah matahari dan benda-benda sebuah matahari dan semua objek yang 2 90.53% 
langit lain yang terikat oleh gaya terikat oleh gaya gravitasinya 3 88.50% 
gravitasi 4 85.47% 
5 82.05% 
In the solar system, there is a Inthe solar system, there is a collection 6 78.63% 
collection of celestial bodies, namely of celestial bodies, namely the sun and 7 75.86% 
the sun and other celestial bodies that all objects that are bound by its 8 73.04% 


are bound by the force of gravity. 


gravitational force. 
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Testing Text 1 Text 2 K-gram Similarity Results 
Testing 6 In Indonesia: Mie ayam atau bakmi In Indonesia: Mie ayam merupakan salah 0 100.00% 
ayam adalah masakan indonesia satu masakan khas indonesia 1 86.96% 
2 79.17% 
Mie Ayam or chicken noodle is Chicken noodle is one of the 3 53.85% 
Indonesian cuisine Indonesian's signatures cuisine 4 43.14% 
5 32.00% 
6 25.00% 
7 17.39% 
8 9.09% 
Testing 7 In Indonesia: Daring merupakan In Indonesia: Daring merupakan proses 0 100.00% 
proses pertukaran informasi antar pembelajaran atau bertukar informasi 1 97.30% 
komputer yang telah terhubung melalui hubungan sebuah internet 2 83.54% 
melalui internet 3 72.73% 
4 66.67% 
Online is the process of exchanging Online is a process of learning or 5 62.22% 
information between computers exchanging information through an 6 56.82% 
connected to the internet internet connection vi 51.16% 
8 45.24% 
Testing 8 In Indonesia: Biji kopi yang disangrai In Indonesia: Cara menikmati kopi yaitu 0 100.00% 
kemudian dihaluskan sehingga dengan menyeduh biji kopi yang 1 97.14% 
menjadi bubuk kopi dapat dinikmati disangrai kemudian dihaluskan sehingga 2: 87.10% 
dengan menyeduhnya menjadi bubuk 3 79.41% 
4 70.59% 
The roasted coffee beans are then The way to enjoy coffee is to brew 5 64.71% 
ground into coffee grounds that can coffee beans that have been roasted and 6 59.70% 
be enjoyed by brewing them. then turning them into coffee grounds 7 58.46% 
8 57.14% 
Testing9 In Indonesia: Kementerian Industri dan In Indonesia: Sebanyak 14.000 ventilator 0 100.00% 
Teknologi Informasi China mengatakan non-invarsif dan 2.900 invarsif telah di 1 94.44% 
melalui situs resminya bahwa awal bulan kirimkan ke kota Hubei oleh perusahaan- 2 72.41% 
Maret, perusahaan-perusahaan ventilator perusahaan ventilator di china. 3 54.30% 
di China telah mengirimkan sekitar 4 46.91% 
14.000 ventilator non-invasif dan 2.900 5 41.67% 
invasif ke Kota Hubei, China 6 36.78% 
7 32.95% 
China's Ministry of Industry and A total of 14,000 non-invasive and 2,900 8 29.21% 
Information Technology said on its invasive ventilators have been shipped to 
official website that in early March, Hubei city, by ventilator companies in 
ventilator companies in China had China. 
shipped around 14,000 non-invasive 
and 2,900 invasive ventilators to 
Hubei City, China. 
Testing In Indonesia: Angka kematian di In Indonesia: Jumlah kasus kematian di 0 100.00% 
10 Italia berjumlah 6,077 kematian dari cina berbanding terbalik dengan italia 1 96.77% 
63,927 kasus atau setara dengan 9,51 yang berjumlah 6.007 kematian dari 2 76.67% 
persen, hal ini berbanding terbalik 63.927 kasus yaitu sebesar 9.51 persen. 3 62.86% 
dengan jumlah kasus dan kematian di 4 52.63% 
china. 5 44.16% 
6 34.21% 
The death rate in Italy is 6,077 deaths The number of deaths in China is 7 27.03% 
of 63,927 cases, equivalent to 9.51 inversely proportional to Italy which was 8 22.22% 
percent, this is inversely proportional amounted to 6,007 deaths of 63,927 
to the number of cases and deaths in cases, equivalent to 9.51 percent. 
China. 
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